gpt_bigcode: added FusedSDPA kernel #1138

mgonchar · 2024-07-17T15:41:37Z

Added support of following options to gpt_bigcode (starcoderbase) model use_flash_attention,
flash_attention_recompute,
flash_attention_fast_softmax,
flash_attention_causal_mask

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

mgonchar · 2024-07-17T20:37:43Z

Original implementation

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16

Stats

---------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 3970.2326582228557 tokens/second
Number of HPU graphs                = 15
Memory allocated                    = 84.78 GB
Max memory allocated                = 94.62 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 64.8335093039932 seconds
---------------------------------------------------------------------------------------------------------------

FusedSDPA

python run_generation.py \
    --model_name_or_path bigcode/starcoderbase-3b \
    --use_hpu_graphs \
    --use_kv_cache \
    --batch_size 42 \
    --max_new_tokens 1024 \
    --do_sample\
    --bf16 \
    --use_flash_attention

Stats

--------------------------------------------------------------------------------------------------------------
Throughput (including tokenization) = 4175.830644970393 tokens/second
Number of HPU graphs                = 15
Memory allocated                    = 13.41 GB
Max memory allocated                = 94.45 GB
Total memory available              = 94.62 GB
Graph compilation duration          = 75.58716204599841 seconds
--------------------------------------------------------------------------------------------------------------

examples/text-generation/utils.py

optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

vidyasiv

@mgonchar , thank you for your PR.

Please run make style prior to future submissions, it takes care of code formatting fixes
Please verify if bigcode/starcoder which also uses this model file runs alright for you. I am seeing an error RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generic failure]. for python run_generation.py --model_name_or_path bigcode/starcoder --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --bf16 --use_flash_attention. I suspect this has to do with "starcoderbase" check and it works if substituted with "starcoder" but we need a better match because we want to avoid starcoder2 perhaps?
Testing: Please add a test for starcoder, starcoderbase with flash attention options in tests/test_text_generation_example.py

I will do a another pass after we resolve these

mgonchar · 2024-07-20T22:25:30Z

@vidyasiv I've updated this PR, based on your feedback. Please have a look

vidyasiv

Minor typo.I was able to run tests and so far LGTM

optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py

vidyasiv

Please re-run make style

mgonchar · 2024-07-22T19:33:49Z

@vidyasiv done

vidyasiv · 2024-07-22T20:29:11Z

@regisss , please take a look

HuggingFaceDocBuilderDev · 2024-07-25T18:14:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Added support of following options to gpt_bigcode (starcoder class of models) use_flash_attention, flash_attention_recompute, flash_attention_fast_softmax, flash_attention_causal_mask - Updated test for starcoder model

mgonchar · 2024-07-29T17:29:10Z

PR rebased. I've rechecked rebased code - no regressions found. @regisss please have a look

regisss

LGTM!

mgonchar requested review from ZhaiFeiyue and regisss as code owners July 17, 2024 15:41

mgonchar force-pushed the gpt_bigcode_fusedsdpa branch 2 times, most recently from 80da185 to 416ad8a Compare July 17, 2024 18:23

libinta added the review wip label Jul 17, 2024

vidyasiv reviewed Jul 17, 2024

View reviewed changes

examples/text-generation/utils.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py Outdated Show resolved Hide resolved

vidyasiv reviewed Jul 17, 2024

View reviewed changes

mgonchar force-pushed the gpt_bigcode_fusedsdpa branch 2 times, most recently from eb41c76 to 815c896 Compare July 18, 2024 23:46

vidyasiv approved these changes Jul 22, 2024

View reviewed changes

optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py Outdated Show resolved Hide resolved

mgonchar force-pushed the gpt_bigcode_fusedsdpa branch from 815c896 to 4fbb0bb Compare July 22, 2024 17:37

vidyasiv suggested changes Jul 22, 2024

View reviewed changes

mgonchar force-pushed the gpt_bigcode_fusedsdpa branch from 4fbb0bb to 41edbba Compare July 22, 2024 19:19

vidyasiv approved these changes Jul 22, 2024

View reviewed changes

libinta added run-test Run CI for PRs from external contributors synapse1.17 PR that should be available along with Synapse 1.17 but have no dependency on Synapse 1.17 content. and removed review wip labels Jul 24, 2024

gpt_bigcode: added FusedSDPA kernel

7aa7a8e

- Added support of following options to gpt_bigcode (starcoder class of models) use_flash_attention, flash_attention_recompute, flash_attention_fast_softmax, flash_attention_causal_mask - Updated test for starcoder model

mgonchar force-pushed the gpt_bigcode_fusedsdpa branch from 41edbba to 7aa7a8e Compare July 29, 2024 17:28

regisss approved these changes Jul 29, 2024

View reviewed changes

regisss merged commit 59d182d into huggingface:main Jul 29, 2024
4 checks passed

mgonchar deleted the gpt_bigcode_fusedsdpa branch July 29, 2024 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt_bigcode: added FusedSDPA kernel #1138

gpt_bigcode: added FusedSDPA kernel #1138

mgonchar commented Jul 17, 2024 •

edited

Loading

mgonchar commented Jul 17, 2024

vidyasiv left a comment •

edited

Loading

mgonchar commented Jul 20, 2024

vidyasiv left a comment

vidyasiv left a comment

mgonchar commented Jul 22, 2024

vidyasiv commented Jul 22, 2024

HuggingFaceDocBuilderDev commented Jul 25, 2024

mgonchar commented Jul 29, 2024

regisss left a comment

gpt_bigcode: added FusedSDPA kernel #1138

gpt_bigcode: added FusedSDPA kernel #1138

Conversation

mgonchar commented Jul 17, 2024 • edited Loading

Before submitting

mgonchar commented Jul 17, 2024

vidyasiv left a comment • edited Loading

Choose a reason for hiding this comment

mgonchar commented Jul 20, 2024

vidyasiv left a comment

Choose a reason for hiding this comment

vidyasiv left a comment

Choose a reason for hiding this comment

mgonchar commented Jul 22, 2024

vidyasiv commented Jul 22, 2024

HuggingFaceDocBuilderDev commented Jul 25, 2024

mgonchar commented Jul 29, 2024

regisss left a comment

Choose a reason for hiding this comment

mgonchar commented Jul 17, 2024 •

edited

Loading

vidyasiv left a comment •

edited

Loading